EGI: Info Pages

Workload Management System - 3.3.4

Current State :: Production

Contact :: cristina.aiftimiei@pd.infn.it

Technical Contact :: cristina.aiftimiei@pd.infn.it

Description :: The Workload Management System (WMS) comprises a set of grid middleware components responsible for the distribution and management of tasks across grid resources, in such a way that applications are conveniently, efficiently and effectively executed..

Release Date :: 20111215

Major Version :: 3

Minor Version :: 3

Revision Version :: 4

Release Notes :: What's new This is an update of WMS basically fixing what follows: A problem with the proxy purger cron job which prevented, on ext3, new delegations for a certain DN after its 31999th submission A glitch in querying data catalogues there were present also in the gLite versions Wrong ownership of /var and /var/log directories which were owned by the glite user. Please note that the fix applies to new installations. In case of update the ownership of these directories must be fixed manually (see below) Issues with sandbox purging by the LogMonitor A catch in ICE concerning job status change detection GLUE2 publication. This also allows the publication of the WMS and EMI middleware release JeMalloc, an optimized memory allocator, is automatically installed and used by the WM module This update introduces also the Nagios probe for WMS. Documentation available at WMSProbe Installation and configuration Yaim (re)configuration is required after installation/update. In case of co-location with the L&B, make sure that yaim is run also for the latter. In case of update, before applying the update stop all the services. After the update (after yum update and after yaim configuration): set the ownership of the directories /var and /var/log to root.root execute the cron job /etc/cron.d/glite-wms-create-host-proxy.cron In case of a clean install, after yaim configuration: execute the cron job /etc/cron.d/glite-wms-create-host-proxy.cron In the configuration file /etc/glite-wms/glite_wms.conf the following two parametes for the load_monitor must always have the same values, even if different from the default ones: jobSubmit = "${WMS_LOCATION_SBIN}/glite_wms_wmproxy_load_monitor --oper jobSubmit --load1 22 --load5 20 --load15 18 --memusage 99 --diskusage 95 --fdnum 1000 --jdnum 150000 --ftpconn 300"; jobRegister = "${WMS_LOCATION_SBIN}/glite_wms_wmproxy_load_monitor --oper jobRegister --load1 22 --load5 20 --load15 18 --memusage 99 --diskusage 95 --fdnum 1000 --jdnum 150000 --ftpconn 300"; The admin guide has been updated with the following information: The location of the drain file has changed with respect to the gLite version Configuration tips in case colocation with the LB Check out the guide for more details. Known issues As said above, this update fixes a problem on ext3 with the proxy purger preventing new delegations for a certain user after her 31999th submission. Even with this fix, there are at any rate some minor issues on ext3: a: there can be at most 31999 different users submitting to a certain WMS (very unlikely) b: there can be at most 31999 valid (i.e. not expired) proxies for a certain user (the purging of the expired proxy files is done on the WMS by a cron job which runs every 6 hours) To definitely avoid this issues, ext4 is needed. Nonetheless, re-creating new delegations each time without reusing them has to be considered a bad and not supported practice, so that what mentioned here only for the sake of clarity should be actually considered a non-issue more than a known issue. In case of WMS+LB colocation, error "no state in DB" on glite-wms-job-submit means that the WMS has not been authorized in the L&B authorization file. Refer to the 'Installation and configuration" section above to fix this issue. If there are problems with purging, or the first output retrieval gives "Warning - JobPurging not allowed (CA certificate verification failed)", that means that the host proxy certificate has not been installed. Refer to the "Installation and configuration" section above to fix this issue. On very high loads, kind of submitting 1k/2k collections of 25 nodes with a frequency of 60 seconds, the following call to the L&B might require a not trivial time (wmproxy.log): "WMPEventlogger::registerSubJobs": Registering DAG subjobs to LB Proxy.. If this time is long enough to exceed the mod_fcgid IPCCommTimeout parameter, then the request will be terminated with the consequence of leaving the collection pending forever. Other than increasing IPCCommTimeout, check that L&B is properly doing its purging. Especially when in 'both' mode, the admin can act on the purging policy to make it more frequent. In proxy mode, the admin can eventually decide to turn off automatic purging of the jobs in terminal state (-G option of glite-lb-bkserverd). The "job feedback", a newly introduced feature to replan jobs stuck at blocking queues, is far from being perfect, at this stage. Its use is encouraged to give us 'feedback'. The feedback is a mechanism that relies on the existence of a global synchronization token, so requiring the shallow resubmission feature enabled. For this reason, it must not be used with the deep resubmission enabled.

Additional Details :: https://wiki.egi.eu/wiki/UMD-1:UMD-1.5.0#emi.wms.sl5.x86_64

Change LOG :: Put the Change logs here (multiline field)

Repository URL :: sw/production/umd/1/sl5/x86_64/updates

Documentation Links ::

Keywords ::

EGI EGI Community repository

Workload Management System - 3.3.4